Web Page Classification Using Relational Learning Algorithm and Unlabeled Data

نویسندگان

  • Yanjuan Li
  • Maozu Guo
چکیده

Applying relational tri-training (R-tri-training for short) to web page classification is investigated in this paper. R-tri-training, as a new relational semi-supervised learning algorithm, is well suitable for learning in web page classification. The semi-supervised component of R-tritraining allows it to exploit unlabeled web pages to enhance the learning performance effectively. In addition, the relational component of R-tri-training is able to describe how the neighboring web pages are related to each other by hyperlinks. Experiments on Web-Kb dataset show that: 1) a large amount of unlabeled web pages (the unlabeled data) can be used by R-tri-training to enhance the performance of the learned hypothesis; 2) the performance of R-tri-training is better than the other algorithms compared with it.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Using unlabeled data to improve classification in the naive bayes approach: Application to web searches

This paper introduces a method to build a classifier based on labeled and unlabeled data. We set up the EM algorithm steps for the particular case of the naive Bayes approach and show empirical work for the restricted web page database. Original contributions includes the application of the EM algorithm to simulated data in order to see the behavior of the algorithm for different numbers of lab...

متن کامل

Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Support Vector Machines present an interesting and effective approach to solve automated classification tasks. Although it only handles binary and supervised problems by nature, it has been transformed into multiclass and semi-supervised approaches in several works. A previous study on supervised and semi-supervised SVM classification over binary taxonomies showed how the latter clearly outperf...

متن کامل

Combining Labeled and Unlabeled Data with Co - Training yAvrim

We consider the problem of using a large unla-beled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a setting in which the description of each example can be partitioned into two distinct views, motivated by the task of learning to classify web pages. For example, the description of a web page can be partitio...

متن کامل

Relational Learning

Most of the content-based approaches to text and web document classification explored in other related projects are based on the bag of words model, well known from the area of Information Retrieval. This model is simple and efficient, but fails to capture many additional document features such as the internal HTML structure, language structure and inter-document link structure. All this howeve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JCP

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2011